Skip to content

GH-48251: [C++][CI] Add CSV fuzzing seed corpus generator#48252

Merged
pitrou merged 8 commits intoapache:mainfrom
pitrou:gh48251-csv-seed-corpys
Dec 1, 2025
Merged

GH-48251: [C++][CI] Add CSV fuzzing seed corpus generator#48252
pitrou merged 8 commits intoapache:mainfrom
pitrou:gh48251-csv-seed-corpys

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Nov 25, 2025

Rationale for this change

The CSV seed corpus for fuzzing currently consists of sample data files from the Pandas project and our own testing repository. This PR adds an executable that generates custom seed files with well-defined characteristics designed to exercise the various data types that the CSV reader is able to infer automatically.

This PR also switches the RandomArrayGenerator facility to the non-"fast" PCG random generators, which give better output especially relative to the seed. This requires some minor changes in the tests to workaround some issues that changing the random generator surfaced.

Are these changes tested?

By existing tests.

Are there any user-facing changes?

No.

GeneratorFactory(ValueType min, ValueType max) : min_(min), max_(max) {}

auto operator()(pcg32_fast* rng) const {
auto operator()(pcg32* rng) const {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out pcg32_fast is not high quality. When used with RandomArrayGenerator::Strings, the first string character would very often be A...

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 25, 2025
@pitrou pitrou force-pushed the gh48251-csv-seed-corpys branch 5 times, most recently from 3451a5c to bcce6c5 Compare November 26, 2025 10:12
@pitrou
Copy link
Member Author

pitrou commented Nov 26, 2025

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: bcce6c5

Submitted crossbow builds: ursacomputing/crossbow @ actions-e5f01b72d0

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou pitrou added the CI: Extra: C++ Run extra C++ CI label Nov 26, 2025
@pitrou pitrou marked this pull request as ready for review November 26, 2025 11:56
@pitrou pitrou requested a review from zanmato1984 November 26, 2025 12:00
@pitrou pitrou force-pushed the gh48251-csv-seed-corpys branch from bcce6c5 to 7d45596 Compare November 27, 2025 14:09
@pitrou
Copy link
Member Author

pitrou commented Nov 27, 2025

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 7d45596

Submitted crossbow builds: ursacomputing/crossbow @ actions-09618dfadc

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm. Some minor questions.

ARROW_ASSIGN_OR_RAISE(auto buffer, WriteRecordBatch(batch, options));

ARROW_ASSIGN_OR_RAISE(auto sample_fn, dir_fn.Join(sample_name()));
std::cerr << sample_fn.ToString() << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use standard error rater than standard out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No precise reason, this is the same thing we're doing in other fuzz corpus generators.

Comment on lines 45 to 49
read_options.block_size = 1000;
auto parse_options = ParseOptions::Defaults();
auto convert_options = ConvertOptions::Defaults();
convert_options.auto_dict_encode = true;
convert_options.auto_dict_max_cardinality = 50;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block_size one is to increase the likelihood of chunking and the number of chunks, to exercise chunked reading and parallelization more. The auto_dict_max_cardinality just explicitly sets to the default value, so it's really a no-op but it signals a knob that we might want to turn.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, most files generated by this PR are 5-10 kB in size.

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pitrou pitrou force-pushed the gh48251-csv-seed-corpys branch from 7d45596 to 2f092c2 Compare December 1, 2025 08:34
@pitrou
Copy link
Member Author

pitrou commented Dec 1, 2025

@github-actions crossbow submit fuzz

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

Revision: 2f092c2

Submitted crossbow builds: ursacomputing/crossbow @ actions-805c4b6939

Task Status
test-build-cpp-fuzz GitHub Actions

@pitrou pitrou merged commit a32730c into apache:main Dec 1, 2025
46 of 47 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Dec 1, 2025
@pitrou pitrou deleted the gh48251-csv-seed-corpys branch December 1, 2025 09:14
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit a32730c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 105 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants